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This work has been submitted to the IEEE for possible 
publication. Copyright may be transferred without notice, 
after which this version may no longer be accessible. 

Abstract — I present several new relations between mutual infor- 
mation (MI) and statistical estimation error for a system that 
can be regarded simultaneously as a communication channel and 
as an estimator of an input parameter. I first derive a second- 
order result between MI and Fisher information (FI) that is 
valid for sufficiently narrow priors, but arbitrary channels. A 
second relation furnishes a lower bound on the MI in terms 
of the minimum mean-squared error (MMSE) on the Bayesian 
estimation of the input parameter from the channel output, one 
that is valid for arbitrary channels and priors. The existence of 
such a lower bound, while extending previous work relating the 
MI to the FI that is valid only in the asymptotic and high-SNR 
limits, elucidates further the fundamental connection between 
information and estimation theoretic measures of fidelity. The 
remaining relations I present are inequalities and correspon- 
dences among MI, FI, and MMSE in the presence of nuisance 
parameters. 

Index Terms — Mutual information, MMSE, Bayesian estimation, 
Fisher information, nuisance parameters 



I. Introduction 

Statistical information theory JT], |2) constitutes an essential 
tool for modern signal processing, computation, coding, and 
communication systems. Its core philosophy hinges on the 
notions of information potential and the ability of systems to 
encode, transmit and decode information about one or more 
input parameters. 

Based in statistical estimation theory, Fisher information 
(FI) [3| on the other hand represents the sensitivity of statis- 
tical data to one or more input parameters. Its inverse, the so- 
called Cramer-Rao bound (CRB), yields a useful lower bound 
on the variance of any statistical data-based estimation of those 
parameters. 

In spite of the different essential motivations for the two 
families of information measures, mutual information (MI) and 
FI are closely related at least asympotically in the limit of 
a large number of conditionally independent measurements 
El, O, J6), Q. A recent paper explores the validity of this 
asymptotic relationship when the number of measurements is 
not particularly large |H). 

The relation of MI to FI is essentially a local one that is 
valid only in the limit of either a narrow channel PDF, as 
in Ref.[]4), or a narrow input PDF, as we shall see in this 
paper. In the more general case, the MI, as I shall also show, 
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is related more naturally to the minimum mean squared error 
(MMSE) of Bayesian estimation. Unlike previous work (9), 
[10 1, 1 11 1, on this topic, the new relation, a lower bound on 
the MI, is general and applicable to arbitrary channel and input 
statistics. It may be regarded as a global generalization of the 
more restrictive local relations between MI and FI. 

A number of additional correspondences between the MI 
and MMSE are derived that apply when either more measure- 
ments, or channels, are added or multiple input parameters 
must be estimated at once. In the latter case if the input 
parameters are statistically independent, then each parameter 
serves as a nuisance for the other parameters that must, in 
general, reduce both the MI and the fidelity of estimation for 
each parameter. These local and global considerations on the 
fundamental relationship between MI and Bayesian estimation 
error are the subject of this paper. 

II. A Second-Order Relation between MI and FI 

Let X be an input parameter that is statistically distributed 
according to the probability density function (PDF) P(x) lfl2ll 
with mean X and variance o\. Let Y be an output variable, 
e.g., a measurement variable, that carries information about X, 
and is distributed according to the PDF P(y). For notational 
definiteness, let us take these variables to be continuous over 
appropriate ranges of values, but the analysis of this section 
applies equally well to discrete random variables too, provided 
all integrals over such variables are regarded as discrete sums 
over the corresponding sample spaces. 

The communication channel, or the measurement system as 
the case may be, is described by means of the conditional PDF, 
P(y\x). In spite of the notation, there is no restriction placed 
on the number of output variables represented by the symbol. 
In other words, Y is in general a multi-dimensional output 
vector. Although I shall for clarity assume initially that the 
input is one-dimensional, the generalization to multiple input 
parameters, as we shall see subsequently, is straightforward. 

The three PDFs are related according to the Bayes' rule, 



P(y) = / P(y\x)P(x)dx. 



(1) 



The MI is defined in terms of the various PDFs by three 
different entirely equivalent expressions, 

I(X;Y) = h(X) - h(X\Y) 
= h(Y) - h(Y\X) 

= h(X) + h(Y)-h(X,Y), (2) 

where for each PDF h denotes the corresponding differential 
entropy defined by averaging the negative logarithm of the 
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PDF over the joint PDF, P(x, y), 

,(X) = -/P(,)l n p(^; 
h(X\Y) = -ff P{x lV ) lnP(x\y)dxdy; 

h(X, Y) = -JJ Pfa V) In P(x, y) dx dy; (3) 

and so on. I shall always use the natural logarithm for the 
definition of entropies in this paper, as it yields the simplest 
form of the final results. All entropy and information measures 
are thus expressed in natural units, or nats. 

By using definitions of form OJ in the second of the 
expressions (01 and using the Bayes relation (HJ, we may 
express the MI as the average 

„P(vW) 



I(X;Y) = -E 



In / P{x')- 



-dx' 



(4) 



P{y\x) 

By expanding P(y\x') in a Taylor series of powers of the 
deviation (x' — x), we may transform the logarithmic term in 
Eq. ©, 



er(")(x) d n P(y\x) 



n\P(y\x) dx n 



i i P r x >\ p (y\ x ) dx > _ i 

' [ ' P(y\x) 

(5) 

where the x dependent "moments" of the X-PDF are defined 

as 

a {n) (x) = / P(x') (x' - x) n dx'. (6) 

By subtracting X, the mean value of X, from both x' and x 
inside the integrand in Eq. (O and noting that linear deviations 
from the mean average to 0, we may easily evaluate the first 
two x-dependent moments as 

aW(x) = -(x-X); a ( - 2 \x)^a 2 x + (x-X) 2 . (7) 

We may now expand the logarithm Q to second order in 
the deviations and note that 

1 d n P{y\x) _ d n 
P{y\x) dx 11 ~ dx n 



dyP(y\x) 



P(y\x) dy = 



(8) 

for all n > 1 . In view of this result, the only contributing term 
to the second order is —(1/2) [cr^^a;)] (<91n P(y\x)/dx) 2 . 
Substituting this term into Eq. © yields to the second order 
the following expression for the MI, I(X; Y): 



I(X;Y) 



dxP(x) cr (1) (x) 

<91n P(y\x 
dx 



dyP(y\x) 



~] 2 



^ / dxP(x)(x~ X) 2 J(Y\x), 



(9) 



where J(Y\x) is the FI defined locally at each value of X as 



J(Y\x)= / dyP(y\x) 



d\n P(y\z 
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dx 



(10) 



This is the first important result of the paper. Its validity is 
guaranteed for sufficiently narrow priors for which the higher- 
order deviations about the input mean are negligible. Note the 



non-local character of this second-order equality (O: The MI 
is a squared-deviation-weighted average of the FI, the latter 
evaluated locally over the full sample space of X. 

For multiple-input, multiple-output (MIMO) channels, the 
following multi-parameter analog of the second-order result 
© is easily derived as well: 



I(X;Y) = X - /dxP(x) J2 Sx j Sx kJjk(Y\x), (1 



1) 



where 



Sxj = 



Xj denotes the deviation of the jth 



component of the input vector x from its mean value. 

It is also possible to extend relations (O and (fTTT) to the case 
of discrete random input parameters by replacing all integrals 
over x to discrete sums over values in the sample space of X, 
writing instead of Eq. (© 



- In E x 



P(V\X) 
P(y\x) 



= — In < 1 + E 



P(y\X) - P(y\x) 
P(y\x) 



(12) 

expanding the logarithm in a power series, and then noting 
that up to the second order it may be expressed as 

P(y\X) - P(y\x) 



1 
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P{y\x) 
P(y\X) - P( y\x) 
P(y\x) 
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(13) 



where the inequality follows from a simple application of 
the Cauchy-Schwarz inequality. The subscript X to the 
expectation-value symbol indicates that the expectation is 
taken relative to X, keeping other variables fixed. An expecta- 
tion of the RHS above, first over y, given x, and finally over x 
yields the following upper bound on MI to the second order: 



I(X;Y) < ^E x Ex>[K{X,X% 



(14) 



where K(X, X') defined by 

" ~P{Y\X') - P{Y\X) 



K(X,X') = 



P(Y\X) 



X,X'\ (15) 



is the Chapman-Robbins information (CRI) 1 1 3 1 . For a fixed 
value of X, the CRI when minimized over all possible values 
that X' can take yields, via its reciprocal, the tightest lower 
bound on the error in estimating the discrete variable in the 
single-test-point optimization subspace. Note that the upper 
bound (fT4l applies to MIMO channels as well. 

The results of this section have a simple interpretation: For 
a narrow input PDF, the MI, like the FI and CRI, is a local 
sensitivity based measure of information. The more sensitive 
the channel PDF - and thus the data - to the input, the larger 
all these information measures. The Gaussian linear channel 
illustrates this point well. 

A. The Gaussian Linear Channel 

Consider the Gaussian linear channel in which X and Y 
are related through a linear gain parameter a, a linear bias b, 
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and an additive noise N distributed according to a zero-mean 
Gaussian PDF of variance a 



2 . 

JV 



Y = aX + b + N, N~Af(0,a 2 N ) 



(16) 



In this case, the FI of Y, given X = x, is easily computed to 
be 



J(Y\x) 



(17) 



'JV 



independent of x. In view of this result, the second-order 
equality © becomes 1/2 times the power SNR, which is the 
ratio of a 2 times the A-variance and the noise variance, 



I(X ; y) = I^ = i S NR. 

Z <T AT Z 



(18) 



JV 



Note that the Gaussian-channel result ( fTSI ) is independent of 
the statistics of X. It is also in agreement with the well known 
expression for the MI of a Gaussian channel with a Gaussian 
input PDF, 



(19) 



I(X;Y) = -In ( 1 



when the latter is expanded to the lowest order in a 2 x . 

For input PDFs that have arbitrary width, a different relation 
between the estimation error and MI can be obtained. The 
precise relation in this case involves the minimum mean- 
squared error of Bayesian estimation and provides a lower 
bound on the MI. I next derive this lower bound. 

III. Bayesian Estimation and Minimum 
Mean-Squared Error 

A good Bayesian estimation is one that reduces the mean 
squared error (MSE) to a value below the variance of the input 
PDF, the so-called prior. The variance of the prior represents 
the maximum MSE incurred by electing to use the mean of 
the prior as the trivial estimator when no information from 
data is availaible as, e.g., in the limit of a vanishing SNR. 

The MSE of a Bayesian estimator, X(Y), of X is defined 

MSE^ = E j[A(F) — A] 2 j , (20) 

where the statistical average is taken over the joint distribution 
of X and Y. The estimator X that minimizes the MSE is 
called the minimum-MSE estimator (MMSEE) Q4). It is easily 
shown to be the mean of X, given Y, i.e., its posterior mean, 



X M (Y) = E(X\Y) = xP(x\Y)dx 



(21) 



Its mean value is the mean of the prior, X. 

The MSE corresponding to the MMSEE is the minimum 
MSE (MMSE) that provides the tightest possible lower bound 
for the MSE of any Bayesian estimator of X. Since [A(F) — 
X} 2 = X 2 {Y) - 2 A" A" + X 2 , we may express the MSE (EQjl 
for the MMSEE, i.e., the MMSE as 



MMSE 



E(A 2 ) 
E(A 2 ) 



2E[E(X\Y)X M (Y)]+E[X 2 M (Y)] 



E[X 



'Mi 



(22) 



where the last two equalities are obtained by recognizing that 
E{X\Y) is the MMSEE, X M (Y), and that X and X M both 



have the same expectation. Since variance is always non- 
negative, the last equality proves that the MMSE can never 
exceed the prior variance. 

IV. Relation between Mutual Information and 
MMSE 

The conditional differential entropy, h(X\Y), sometimes 
called equivocation, may be expressed as a statistical average 
over the output, Y, 



h(X\Y) = -E 



P(x\Y) \nP(x\Y)dx 



(23) 



where the argument of the F-average is the conditional 
entropy, given a fixed value of Y . But for a given variance, 
<7jq y , of the PDF P(x\Y), its entropy is bounded above by the 
entropy of a Gaussian PDF with the same variance J3, namely 
(1/2) ln(27refj^.| y ). As a result, the conditional differential 
entropy d24l ) is bounded above as follows: 

1, 



h(X\Y) < -E Y 



In \ 2-Kea\\ Y 



2 

< Jln(27re) + iln 



J dyP{y)a 2 x]Y 



(24) 



where the second inequality results from the convexity of the 
logarithm. 

To see that the integral on the RHS of the second of the 
relations (l24l i evaluates to the MMSE, we may note that in 
view of relation (f2TT> 

2 



'X\Y 



xP(x\Y)dx 



P(x\Y)dx 



= / X(Y)-x P(x\Y)dx 



(25) 



whose F-average is simply the MSE for the MMSEE estima- 
tor, namely the MMSE. (To simplify notation here and in the 
rest of the paper, I have omitted the subscript M from the 
MMSE estimator.) Putting results d24b and dZSl ) together, we 
arrive at the following upper bound on equivocation: 

h(X\Y) < -ln(27reMMSE) (26) 
and the corresponding lower bound on the MI (|2]): 

I{X;Y) > h(X) - -ln(27reMMSE). (27) 

Result (|27| > is the second major contribution of this paper. 
It demonstrates the precise inverse relationship between the 
minimum Bayesian estimation error and the minimum statisti- 
cal information that can be transmitted by the measurement 
channel. For an additive, linear Gaussian channel with a 
Gaussian input, both inequalities in Eq. (l24l become equalities, 
the first because in this case P(X\Y) is Gaussian and the 
second because <J 2 X \ Y is independent of Y . Consequently, for 
such channel and input, the inequality d27b is obeyed as an 
equality. Indeed, since the MMSE for this case is simply 

a x+ a2(J Y 2 \x) ' while h{X) is (1/2) ]n(2irea x ), we have 
the well known result, (l/2)ln(l + SNR), for MI, where 
SNR = a 2 a x /o-y^ x is the power SNR and a is the linear 
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gain factor of the Gaussian channel. The derivative equality 
obtained in iflOl , 



rfSNR 



I(X-Y) 



2o\ 



MMSE, 



(28) 



is a simple, immediate consequence of this result specific to 
Gaussian channels. 

For a non-Gaussian channel, the lower bound ( |27| > on the 
MI, I(X; Y), is in general not attainable. I now analyze the 
Poisson channel with a negative-exponential prior to illustrate 
this fact. 

A. The Linear Poisson Channel with a Negative-Exponential 
Prior 

Consider the linear Poisson channel with linear gain (or, 
scaling) factor a and linear bias b, so the conditional mean 
of output Y, given input X, is E(Y\X) = aX + b. The 
conditional Poisson probability distribution (PD) over the 
discrete samples of Y, given X, has the form 

(ax + b) v 

P(y\x) = - j-t-expl-iax + b)], y = 0,1,2,.... (29) 

y' 

If we take the prior PDF to be negative exponential with mean 

X, 

exp(— x/X) for x > 
otherwise, 



P(x) 



(30) 



then by Bayes theorem the unconditional Y-PDF takes the 
form 

{ax + b) y 



p(y) 



dx- 



y\ 



■ exp[— (ax + b)] exp(— x/X), 
y = 0,1,2,..-. (31) 



By a suitable scaling and shift of the integration variable, this 
integral may be expressed in terms of the incomplete Gamma 
function, 



T(y + l,u)= / dx cxp(— x) x y , 

J u 



(32) 



as 



p(y) 



i 



(ax)y 



y\ (aX + 1) 



— exp(6/aX)r(y + 1, b(aX + V)/aX). 

(33) 

The following expression for the mean squared MMSEE, 
E[X 2 (F)], is a simple consequence of the definition ( |2"TT i and 
the Bayes theorem: 



E 



X 2 (Y)} = £ 



K(y) 2 



where K(y) denotes the expression 

K(y) — I dx x P (x) p(y\x) . 



(34) 



(35) 



For the Poisson channel and negative-exponential prior, K(y) 
may be expressed in terms of p(y), since the latter has a similar 
expression as ((35) with the only difference that the factor x 
is missing from the integrand. To see this, we first write x = 
(1/ a)(ax + b) — b/ a in expression ( f35T > and then recognize that 
for the Poisson channel PD given by Eq. (|29l (ax + b)p(y\x) 



equals (y + 1) times p(y + l\x). This yields the following 
useful form for K(y): 



K(y)= { -y±^p(y 



1) 



-p(y)- 



(36) 



Substituting this expression into Eq. d34l i and noting that 



+ l)p(y + 1) = (Y) =aX + b; = 1; (37) 



v=o 



v=o 



and E(X 2 ) = 2X 2 for the NE prior CP, we obtain the 
following expression for the MMSE (1221 : 

(y + l) 2 p 2 (y+l) 



MMSE = 2X 2 + 2-X + — - — V ■ 
a a z a 1 z — ' 

y=o 



p{y) 



(38) 

We can now numerically evaluate the MMSE expression d38l)in 
the general case of arbitrary a and b, but for the case of zero 
bias, b = 0, a simple analytical expression can be derived as 
we now show. 

1) The Case of Zero Bias, b = 0: Expression fl33"l ) for p(y) 
now greatly simplifies since the incomplete Gamma function 
in that expression becomes complete, taking the value y\, and 
the sum in expression d38b may now be easily performed 
analytically, since 



~ (y + l)V(y + l) 
y=o 



aX 



p{y) 



(aX + l) 2 ^- 

v ' y=o 



+1)V^ 

' y 
aX 



aX + 1 



(39) 



is related to the sum 2~Z^lo aV = — a )^ 1 two successive 
applications of the differential operator, ad /da. This yields 
the following simple expression for the MMSE when 6 = 0: 



MMSE = 



X 2 



1 + aX 



(40) 



This expression has the desired property of reducing to the 
prior variance, X 2 , in the limit of vanishing SNR, aX — > 0, 
and of vanishing in the opposite limit, aX —> oo. 

The MI may also be evaluated for the Poisson channel and 
negative exponential prior, most simply via the second of the 
expressions ©. Since In p(y\x) — y \n(ax + b) — (ax + b) — 
my!, the conditional mean of — lnp(y\x), given x, is simply 

-E[p(Y»] = -(ax + &)[ln(ax + 6)-l]+Ey| x (lny!). (41) 

A subsequent average over the prior P(x) then yields the con- 
ditional (discrete) entropy H(Y\X), which when subtracted 
from the unconditional output entropy H(Y) = — E[lnp(Y)] 
produces the following exact expression for the MI: 



I(X;Y)= / dxP(x)(ax + b)[\n(ax + b)- 1] 



y=0 



,P(y) ln[p(y) J/!]- 
) 

This too can be evaluated numerically. 



(42) 
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Fig. 1. MI vs. the normalized lineal' gain parameter, aX, for three different 
values of b, as indicated. The solid curves refer to the exact result )42t . while 
the corresponding dashed curves refer to the lower bound )27t . 



Fig. 2. MI vs. the linear bias parameter, b, for three different values of gain 
aX, as indicated. The solid curves refer to the exact result 1421 . while the 
corresponding dashed curves refer to the lower bound ill) . 



The differential entropy of the negative exponential prior 
takes a simple analytical form, since — lnP(x) = \nX + x/X 
whose mean, the differential entropy of X, is simply 1 + lnX: 



h(X) = 1 + IilX. 



(43) 



Use of this expression and the MMSE (138t yields the lower 
bound d27l i on the MI. I now compare this lower bound 
numerically with the exact value given by the expression (1421 1. 

In Fig. 1 I display, as a function of the normalized linear 
gain parameter aX, the exact expression (l42l (solid curves) 
along with the corresponding lower bound (|27] | (dashed 
curves) for three different values of b, namely 0, 50, and 
100. The lower bound becomes tighter as the gain parameter 
increases in value, but typically it fails to provide a useful, 
nontrivial lower bound below a certain threshold value of the 
gain. Indeed, as the exact expression for the lower bound in 
the case b = obtained from Eqs. (143 1 . (l40l . and (l27l i. namely 



I(X;Y)>h(X) 

_ 1 
~ 2 



1 

2 

1 - In 



ln(27re MMSE) 

MMSE\ 
X 2 J 



(44) 



shows, the lower bound drops below the trivial lower bound 
of for aX below 27r/e — Ira 1.31. A similar but higher 
threshold below which the lower bound d27] l ceases to be 
nontrivial is obtained when b is non-zero. However, as b 
increases this lower bound becomes increasingly tighter and 
thus more useful at sufficiently large values of the normalized 
gain, aX. 

In Fig. 2, I plot the exact values and the corresponding 
lower-bound values for the MI as a function of the linear bias 
parameter, b, for three different values of the gain parameter, 
aX. As b increases, the MI decreases as expected since the 
sensitivity of data on the input variable X is reduced. Raising 
the linear gain raises the MI, as expected, for each b value, 
as the previous figure shows. Again, it is clear that the lower 
bound d27l ) is useful one for sufficiently large values of aX 
and b. 



V. Generalization to Multiple-Input, 
Multiple-Output Channels 

When MIMO channels are involved, we may organize the 
input and output variables into two different column vectors, 
say X = (X u . . . , X N ) T and Y = (Y u . . . , Y M ) T , where T 
denotes a matrix transpose. The A-parameter analog of the 
upper bound (l24l is simply 



h(X\Y) < -K Y {\a[{2re) N \C x \ Y \]} 



(45) 



where |Cxiy| denotes the determinant of the positive semi- 
definite covariance matrix of X, given Y. The determinant of 
such a matrix is a product of its N non-negative eigenvalues, or 
simply the Ath power of their geometric mean. Since the latter 
cannot exceed the arithmetic mean of these eigenvalues, which 
is 1/N times the trace of the matrix, and since the logarithm 
is a convex function, we have the following inequalities for 
h(X\Y): 



N N 
h x \Y < yln(27re) + —I 



Inl-TVC 



X\Y 



< yln(27re/A) + ^lnE(TrC X |y) 
N 

= — ln(27reMMSE), 



(46) 



where MMSE here is the average minimum MSE of a 
component-wise estimation of X, 



MMSE = ^E y E | [X - X(Y)] T [X - X(Y)] Y } 

= ^Ey[TrC X |y], 

involving the MMSEE X(Y) for the MIMO problem, 



(47) 



X(Y) = / XP(X|Y)dX. (48) 

Correspondingly, the MI is lower bounded by 

N 

/(X; Y) > h(X) In (2ire MMSE) . (49) 
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Note that in the MIMO case each component of the MMSEE 
minimizes the MSE of the corresponding input parameter, the 
one it estimates. As such, the MMSEE vector (|48l l as a whole 
also minimizes the average MSE per component of the input 
vector. 



VI. Additional Properties of MMSE and Their 
Correspondence with Information 

I now establish two additional important properties of the 
MMSE not previously reported in the literature but which 
help strengthen the correspondences with information I have 
already discussed via relations (fTOb . (fTTT i. $T7\ , and ( |49b . The 
first of these concerns the behavior of the MMSE as additional 
measurements are made. It is well known [2], [3 1 that both MI 
and FI exhibit an additive property, namely 

I(X; Y, Z) = I(X; Y) + I(X; Z\Y) > I(X; Y) 

J(Y, Z; X) = J(Y; X) + J(Z\Y; X) > J(Y; X), (50) 

which represents the fact that in general an additional measure- 
ment only increases information. The conditional information, 
either I(X; Z\Y) or J(Z\Y; X), is a direct measure of the 
capacity of the measurement Z to improve information about 
X, given that the measurement Y has already been made. 
Since the estimation variance is lower bounded by the inverse 
of the Ffl the two relations ( T50l > represent a useful inverse 
relationship between MI and estimation error. 

But this fundamental relationship is at best a local one 
since, as I have argued before, the FI and its inverse, the 
Cramer-Rao lower bound on estimator variance, are local 
measures of information and estimation fidelity. I now show 
that MMSE exhibits a similar behavior, which will serve 
to accord a general global character to this local inverse 
relationship between information and error. 



A. MMSE Cannot Increase with Measurement 

Let us consider two measurements, Y and Z, of the in- 
put parameter X. The joint MMSE estimator, X(Y,Z) = 
E(X\Y, Z), has the following mean squared value: 



E[X 2 (Y,Z)} =E 



J J dxdx'xx' P(x\Y,Z)P(x'\Y,Z) 
P(x,y,z)P(x',y,z) 



dxdx'xx' / / dydz 



(51) 



where the Bayes theorem was used to replace the posterior 
probabilities in terms of the joint PDFs. 
In terms of the integral, 



K{z,y)^ ldxx P{X ^ Z) 



(52) 



we may write the mean squared value of the joint MMSE 
estimator ( BTT i as 



E{X 2 (Y,Z)) 



> 



dy 
dy 
dy 

P{y) 

dxdx'xx' I dy 
E Y {[E(X\Y)} 2 } 



dzK 2 (z,y) ■ I dzP(z\y) 

2 



dzK(z,y)y/P(z\y) 
dx dx'x x P(x, y) P (x , y) 



P(x,y)P(x',y) 

P(y) 



(53) 



1 When multiple inputs are involved, the additivity and inequality relations 
for the FI as well as its inverse must be interpreted in the matrix sense. 



The first equality follows from substituting the Bayes relation, 
P(y,z) — P(y) P(z\y), and definition d52l into expression 
( fSTb and from the unit normalization of the PDF P(z\y); the 
second line follows from the Cauchy-Schwarz inequality; the 
third line from a substitution of the definition d52l ) and the 
identity, J dz P(x,y, z) — P(x,y); and the fourth line from 
an interchange of the order of the integrals. 

Since the last expression in inequality ( 1531 is simply the 
mean squared value of the MMSE estimator, X(Y), relative 
to the measurement Y alone, we have arrived at the desired 
result, 

MMSE(Y, Z) = E(X 2 ) - E[X 2 (Y, Z)\ 

< E(X 2 ) - E[X 2 (Y)} = MMSE(F), (54) 

where we have used the fact that E[X(Y)} = E[X(Y, Z)] = 
E(X) to express the MSE, E{[X(Y)~X] 2 ]}, as the difference 
of mean squared values of the prior and the estimator. Note 
that for the inequality ( 154-b to hold, the two measurements are 
not required to be conditionally independent, given the input 
X. 

The second property of MMSE relates to the case of 
multiple input parameters and how the error in the estimation 
of any one parameter is affected by the presence of the others. 
But to fully appreciate this property, we must place it in the 
context of statistical information processing to which I now 
turn. 

VII. Role of Nuisance in Statistical Information 
Processing 

It is well known that the fidelity of estimation of a pa- 
rameter, defined here as the smallness of the lower bound on 
the statistical variance of the estimator, decreases when other 
parameters are added to the problem. These added parameters, 
when not of interest, are known as nuisance parameters, 
and serve to reduce the fidelity, i.e., increase the variance, 
of estimation of the parameter of interest. The essence of 
this phenomenon is captured well by the FI matrix and its 
inverse whose diagonal elements provide the Cramer-Rao 
lower bounds on the variances of an unbiased estimation of 
the parameters |3|. [15|. 

A similar result must hold in the context of statistical 
information theory as well. It must be possible to show that 
when the output variables Y depend on two input parameters, 
X and U, that are distributed independently, then the MI 
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between X and Y cannot be larger than the MI obtained by 
computing the MI between X and Y for a fixed value of U 
first and then averaging it over the statistical distribution of 
the possible values of U. The latter, averaged MI represents 
the information about X successfully transmitted through the 
information channel when U is held fixed in each instance, so 
the statistical dispersion of U does not corrupt the data relative 
to their capacity to carry information about X. I now prove 
this result. 

Let us define l( + \X; Y) as the MI in the case U serves as 
a nuisance parameter, namely as 



form 



lM(X;Y)=I(X;Y) 

= H(X) - H(X\Y), 



(55) 



where H(X), H(X\Y) are defined as before. This expression 
for MI may also be written as the following average over all 
three variables: 



i (+) (x ; y) = -E-hn 



P(X,Y) 
P{X)P{Y) 



= P(x,y,u)\n 



P(x,y) 



P(x)P(y) 



dx dy du, 



(56) 



as the integral over u only affects the joint density P(x, y, u), 
reducing it to the marginal, P(x, y). 

In the absence of nuisance, which is indicated by a — 
superscript, the MI is the following U— averaged conditional 
MI: 



(X;Y) = -J P(x,y,u)ln 



P{x,y\u) 



P{x\u)P{y\u) 



dx dy du. 
(57) 



Note that I^(X;Y) is the same as the more familiar condi- 
tional MI, I(X;Y\U), so the difference between I^(X;Y) 
and P>(X; Y) is equivalently that between I(X; Y) and its 
conditional version, I(X;Y\U), which, as is well known 0, 
can be of either sign. 

In view of Jensen's inequality applied to the logarithm, the 
difference between the two Mis, (l56b and ( 1571 ). has a lower 
bound, 



I { -\X-Y)~I {+ \X-Y) 



> 



P{x,y,u) In 
In J P{x,y : u) 
In 



P(x,y\u) 



dxdydu 
P(x\u)P(y\u)\ y 

P(x,y)P(x\u)P(y\u) 



In 



L P(x,y\u)P(x)P(y) 
P(u)P(x,y)P(x\u)P(y\u) 



dx dy du 



P(x)P{y) 
P(x\u)P(y,u)P(x,y) 



P{x)P{y) 



dx dy du 
dxdydu. (58) 



In obtaining the last two equalities above, I have used 
the Bayes rule twice, first via the identity P(x, y, u) = 
P(x, y\u) P(u), and then via the identity P(y\u) P(u) = 
P(y,u). 

When the variables X and U are statistically independent, 
P(x\u) = P(x), the above inequality simplifies greatly to the 



I { -\X;Y) -J(+)(X;Y) 

~P(y,u)P(x,y) 



> 



In 
In 



dx dy du 



P{y) 

P(x,y)dxdy = — In 1 = 0, 



(59) 



where I used the fact that J P(y, u) du = P(y) and the 
normalization of the joint PDF P(x, y). This proves our 
assertion. Note that since I have made no explicit use of the 
dimensionality of the input and output spaces in this proof, 
the result is valid for an arbitrary MIMO chennel. 

A. Analogous Result from Statistical Estimation Theory 

A correspondence may be drawn with analogous results 
from statistical estimation theory using FI. One can consider 
two different estimation problems involving nuisance, one in 
which the nuisance is also estimated and another in which it 
is not, which must be treated separately. 

a) Estimation of Both Input and Nuisance Parameters: 
The FI matrix relative to X and U, when both are unknown, 
namely J, may be expressed in terms of the FI matrix relative 
to X, when U is known, namely Jxx, in the following block 
form: 

r T,,., t.,„ i 

(60) 



J XX 




3ux 


•fuu 



where the matrix block Jjju refers to the FI matrix relative 
to U alone, when X is known, and the off-diagonal blocks 
Jxu an d J[/x> which are transposes of each other, refer to 
the cross-sensitivity of the data likelihood relative to X and 
U. The presence of the cross-sensitivity matrices, Jxu an d 
Jux, tends to increase the CRBs since, as one may easily 
show [15 1 that, e.g., the XX block of J -1 has the form 



l Jxx 
the matrix 
•fxu^uu^ux 



-- (Jxx - ^xu^uu^ux) > (Jxx) 1 , (61) 

inequality following from the fact that 
is a positive matrix. For the Bayesian 
case of priors on X and U, assumed for the moment to be 
uncorrelated, the FI matrices relative to these priors on X and 
U must be added to the blocks Jxx and Juu> respectively, 
in expression d60l >. Adding these prior-information-based FI 
submatrices has, as expected, the opposite effect: It decreases 
the CRBs on X and U, thus improving the fidelity of 
estimation. 

b) Estimation of Input without Estimating Nuisance: In 
this case, we must integrate over the statistical distribution of 
nuisance to obtain the needed PDFs from their nuisance-free 
counterparts. We have, in particular, 

P(y|x) =y"F(y|x,u)P(u|x)du, (62) 

where the input, output, and nuisance parameters have been 
organized into three respective column vectors, X, Y, and U. 
If we take the nuisance and input variables to be statistically 
uncorrelated, P(u|x) = P(u), then we may take the gradient 
of Eq. (l62l with respect to x simply, 

V K P(y|x) = Jv x P(y\x,u)P(u)du. (63) 



s 



The inner product of this gradient vector with an arbitrary 
vector, A, of the same length generates a scalar quantity 

A T V x P(y|x) = J \ T V x P(y\x,u)P(u)du. (64) 

Upon writing the integrand in Eq. d64"t as the bilinear product 

[pV2( u )p-l/2 (y | x> U )A T V.P^x, u)] 

x[P 1 / 2 ( u )P 1 / 2 (y|x,u)], (65) 

squaring both sides of that equation, and then using the 
Cauchy-Schwarz inequality, we arrive at the inequality 

' {\ T V x P(y\x,u)} 2 
(66) 



[\ T V x P(y\ X )} 2 < J rfuP(u) 



P(y|x,u 
J P(u)P(y|x,u)du 



Since the last u-integral above evaluates simply to P(y|x) 
according to Bayes rule, by dividing both sides by P(y|x), 
integrating over dy, and finally averaging over X, we obtain 
the desired inequality, 

A T J( +) (Y|X) A < \ T J duP(u)J u (Y|X) A 

= A T J^(Y|X)A, (67) 

where the FI matrices in the presence and absence of nuisance 
are defined as 

1 12 



^(y|x) 



j(+) (Y |X) d ^|/ dxdyP(x)P(y|x) 

x V x P(y|x)VjP(y|x) 
j(-)( Y |X) d =l f J duP(u) Jj dxdyP(x)P(y|x,u) 

V x P(y|x,u)V£P(y|x,u) 



^(y|x,u) 



(68) 



Note that for statistically mutually independent input and nui- 
sance parameters, the prior-based FI for the input paramaters 
is the same whether the nuisance parameters are present or ab- 
sent. It then follows from the the non-negative-definiteness of 
the difference of the data-based FIs, J(-)(Y|X)-J(+)(Y|X), 
implied by relation (|67T i, that the corresponding difference 
between the sums of data-based and prior-based FIs is also 
non-negative-definite. This result embodies the fact that in 
general nuisance parameters even when they are not estimated, 
if statistically independent of the input parameters of interest, 
degrade the fidelity with which the latter can be estimated 

na. 

B. Statistical Correlation of X and U Priors 

For the more general case when X and U are statistically 
correlated, U may indeed carry information about X through 
their correlation, in which case the RHS of the inequality (l58l 
may be negative allowing for 1^ to exceed P As I noted 
before, in this general case the inequality (l58l can be of either 
sign. 

The corresponding result from statistical estimation theory 
is based on the fact that any information that U has about 



X through its correlations with it yields an additional FI 
submatrix, J^j^, to be added to the Jxx block in Eq. d60l >. 
This submatrix represents information that U carries about 
X through the first-order sensitivity of P(u\x) on x. Unlike 
the coupling of U to the data alone, such additional prior 
information can reduce the CRBs on estimating X. When the 
nuisance parameters are not estimated, but are correlated with 
the input parameters of interest, the basic relation d63l used to 
obtain the desired inequality ( t6Tb is itself not valid. Also, the 
prior-based FIs are not necessarily the same with and without 
nuisance. As a result, the data-based FI or prior-based FI or 
both may contain more information about the parameters of 
interest in the presence of nuisance than in its absence. 



C. Two Illustrative Examples 

As a first example, let us consider a Gaussian additive 
channel with additive noise N, 



Y = aX + bU + N, 



(69) 



where X, U, and N are all independently normally distributed 
as follows: 

X~M{X,a x ), [7-^07,4), N~Ar(0,<&). (70) 

Thus the marginal PDF for Y as well as its various conditional 
PDFs are all Gaussian too, 



Y ~ Af(aX 
Y\X ~N{aX 
Y\U ~JV(aX 
Y\X, U ~ N(aX 



bU,a z a x +b z a( 

2 



617, b 2 af r + o\ 



bU,a 



2^2 

cr-i 



bU,a 2 



x 
n)- 



a, 



(71) 



In view of the variances given in relations fi7T[ , we may 
easily write down the Mis of interest here using the well 
known expression for the differential entropy for a Gaussian 
additive channel [ID, 



/(+)(X;Y)=/(X;Y) = iln U 



'x 



b 2 c 



N 



I(-\X;Y) =I(X;Y\U) = ±ln(l 



N 



(72) 



Since b 2 a 2 v > 0, it follows that I ( -\X;Y) > I (+ \X;Y). 
For uncorrected X and U, the latter serves as a nuisance 
relative to information about the former in the sense that the 
terms bU and N in the model d69l simply combine to yield 
an increased noise variance, b 2 afj + o 2 N , on the determination 
of X from Y, But when the nuisance is removed by holding 
U fixed in each measurement, the noise variance is lower at 
a 2 N , leading to increased information about X. 

As my second example, let us modify the Gaussian channel 
represented by Eqs. d69l and (l70l simply to include a corre- 
lation between X and U, so only the first of the relations in 
Eq. (l70l > is changed to the conditional relation 



X\U ~X(aU,a 2 xlu ), 



(73) 



while the remaining relations are unchanged. In the limit that 
a — > 0, the variables X and U become uncorrected as in 
the previous example. Thus the largeness of \a\crjj in relation 
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t0 a x\u ma Y be regarded as the strength of the correlation 
between X and U. 

In view of the relation (jV3j and the fact that U — N(U, crfj), 
the marginal PDF P(x) is also Gaussian. The conditional PDF, 
P(y\u), for Y, given U, may be computed by integrating 
P(y\x, u)P(x\u) over x. The marginal PDF P(y) is then 
obtained by integrating P(y\u)P(u) over u. Using standard 
analysis involving Gaussian integrals, we may derive the 
following marginal and conditional PDFs: 



X ~ Af(aU, a\ = a 2 



„2„2 



X\V 



Y\U ~ Af((aa + b)U, a 2 N + a 2 cr 2 xlu ); 
Y — Af((aa + b)U, cry), 

where the F-variance may be expressed as 

2 2,22 , I , j\2 2 

°Y — a N + fl CT X|!7 + \ aa + ") "if- 



(74) 



(75) 



Having evaluated P(x), we may now evaluate P(u\x), via 
Bayes rule, as the ratio P(u)P(x\u)/P(x), 



P{u\x) 



x\u 



exp 

2naucrx \u 

(x - aU) 2 

2(4|u + a2(j2 u) 



(u — U) 2 (x — au) 2 



1 



2na 2 



exp 



2a% 



2<7 2 

zo x\u 



2<7 2 



(76) 



where the conditional mean, Ui x , and variance, a 



given by the expressions 



'v\x- 



are 



U\x 


— a u\x 


1 ax 


1 


1 

= — + 


a 2 


2 

(7|X 




a x\u 



u 



(77) 



By multiplying P(u\x) with P(y|w, x) and integrating over 
u, we may obtain the last of the needed conditional PDFs, 
namely P(y\x), 



P{y\x) 



1 



2 ™Y\X 



: exp 



bU\ x f 



2a 2 



(78) 



where the conditional mean and variance may be expressed as 

E(Y\x) = 



'x 



bUa 2 



'x\u 



2„2 ' 



or a 



G Y\X = G 



'x\u 



a 2 a1j ' 



(79) 



We are now in a position to write down both (X;Y) 
and I^~\X : Y) by use of the Gaussian-channel entropy 
formula in terms of the variances of P(y), P(y\x), P(y\u), 



and P(y\x. u), 

It+\X-Y) = I(X-Y) 



In 



a 2 N + a 2 (J 2 x]jU + (b + aa) 2 (j\ 



a 2 , ^2,2 

N "U\X 



I^(X;Y) = I(X;Y\U) 




(80) 



Note that in the limit of <tx\u ~^ 0, the two variables, X and 
U, are infinitely tightly coupled. In effect, X = aU, and the 
data Y carry no information about X when U is held fixed. 
This is seen in the relation ( f80b . By contrast, I^ + \X;Y) is 
finite in this limit, and thus trivially exceeds I^(X; Y). The 
other limit, a 0, returns us to the case of uncorrected 
X and U variables for which the results d72b are recouped 
and I { -\X;Y) exceeds I^(X;Y). The more general cases 
in which neither of these limits is a good approximation are 
illustrated in Fig. 3. 

I plot here J(+)(X;F) (solid curves) and I^~){X;Y) 
(dashed curves) for the case of normalized U variance, Su = 
a 2 <j 2 j/(j 2 y! , equal to 5. Each variance is normalized the same 
way by multiplying it with a 2 /a N . Six different values of a, 
which determines the largness of the conditional mean of X, 
given U, were used to generate the various curves. By 
contrast, 1^ is independent of a, as seen from the single 
dashed curve on each plot. A number of observations can 
be made from these plots. First, for a = 0, the variables 
X and U are uncorrected, so in this case the plot of MI in 
the presence of the nuisance variable, U, lies below that for 
MI when uninfluenced by the nuisance variable. Second, as 
a increases, the coupling of X and U becomes increasingly 
less sensitive to the noise in U. This means that when Y is 
measured, its value reveals more information about X than 
when a is smaller. When the nuisance is removed, i.e., U is 
held fixed, then a change of a merely changes the mean value 
of X, leaving its variance unchanged, which is the reason why 
J^' depends neither on a nor on ajj. Third, as the relative 
strength, b/a, of the nuisance increases while a is held fixed, 
1^ increases initially since the data Y possess an increasing 
amount of information about X through the latter's coupling to 
U. However, increasing the strength of the bU term in Eq. ( f69b 
to large values leads to the data becoming more corrupted than 
helped by the nuisance, which leads to an eventual decrease of 
the MI. These two competing tendencies lead to a maximum 
for each curve (left top and bottom), with the location of the 
maxima shifting to larger nuisance-parameter strength values 
with increasing a. Fourth, comparing the plots for the smaller 
vs. larger values of Sx\u (l e ft panels), we see that the tighter 
the X-U coupling the softer the degradation of MI^ + ) with 
increasing strength of the nuisance parameter. Finally, as seen 
from the right-hand panels of the figure, an infinitely tight 
coupling between X and U (for Sx\u — 0) yields, through 
the sensitivity of data to U, information about X as well. This 
information about X degrades when Sx\u increases to finite 
values, the more so the smaller the parameter a. 
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Fig. 3. Plots of I^(X;Y) vs relative strength, b/a, of the nuisance parameter (left top and bottom panels) and vs. weakness of coupling between X 
and U, as measured by a? cr^.y / (right top and bottom panels). The bottom panels refer to a tighter X — U coupling (left panel) and larger nuisance 
parameter strength (right panel) than the corresponding figures in the top panels. 



VIII. MMSE in the Presence of Nuisance 
Parameters 

When multiple input parameters must all be estimated from 
the same measurement(s), one expects the MMSE, like the 
CRB, for estimating any of the parameters to be higher than 
if the others were not present. I prove this result next. 

Let X, U be two input parameters to be estimated from 
data Y . Let P(x, u) be the joint prior on the inputs. The 
MMSE estimator for X in the absence of the nuisance U 
can be defined in terms of the conditional MMSE estimator, 



X V (Y)= xP{x\Y,U = u)dx, 



(81) 



given U — u. It is the MMSE estimator of X for a given value 
of U. Its mean squared value has an expression analogous to 



that found in 

P{y\x,u)P{y\x',u) 



x dy- 



dy 



P(y) 

where K(y, u) stands for the function 



P(y\u) 

duK 2 (y,u)- I duP(y\u)P{u), 



(82) 



K(y,u) = ^ 1 ^^ J xP(y\x,u)P(x\u)dx. (83) 

We also used the Bayes-rule identity, J P(y\u) P(u) du = 
P(y), to arrive at the last line of Eq. d82b . 

A use of the Cauchy-Schwarz inequality in Eq. d82l shows 
that the mean squared value of the MMSE estimator in the 
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absence of nuisance has the lower bound 

dy 



E(Xi(Y)) > 



P{y) 

dy 

P{y) 

dy 



duK(y,u) y/P(y\u)P(u) 
dx du x x' P(x\u) P(y\x, u) P(u) 



where the multipiers are given by 
1 



dx x P{x, y) 



(84) 



P(y) 

= E[X 2 (Y)} 

where a simple substitution of K(y,u) from Eq. < T83T > was 
used to obtain the second relation, Bayes rule to obtain the 
third relation, and the definition of the MMSE estimator X(Y) 
as the posterior mean of X, namely J dxxP(x,y)/P(y), to 
arrive at the final relation. Since the MMSE, as we have noted 
earlier, may be expressed simply as the mean squared value 
of X minus the mean squared value of the MMSE estimator, 
the desired inequality between the MMSE without and with 
nuisance follows immediately, 

MMSE (_) (X) < MMSE (+) (X). (85) 

We deduce from this important result that the presence of 
the nuisance parameter can never lower the MMSE below 
that obtained in its absence, i.e., when the nuisance has a 
known value, regardless of whether the priors on X and the 
nuisance U is statistically correlated or not. This seems to 
exclude the possibility that U if suitably correlated with X 
may serve, as we observed in Sec. IVIII in the context of MI, 
as a source of additional information for X. The answer to this 
apparent paradox may be found in the way MMSE is defined. 
Since given a value of the nuisance u, the MMSE estimator 
minimizes the MSE relative to the corresponding conditional 
prior, P(x\u), on X and measurement PDF P(y\x,u), the 
nuisance-averaged MMSE is not characterizable as the MSE 
for a single, nuisance-averaged MMSE estimator. The MMSE 
metric thus may not possess the same degree of specificity 
as the MI or FI metrics when the effect of nuisance must be 
quantified. 

A. Gaussian Channel and Gaussian Prior 

We now illustrate the effect of nuisance on the MMSE with 
our previous example of a Gaussian channel for which some 
of the relevant PDFs are given in Eqs. (|73]l-(|78l. What we 
need are the MMSE^) estimators, namely X U (Y) given by 
expression dSTl i. and X(Y) by (fJTJ. As is well known from 
the theory of MMSE iTPfl for Gaussian priors and Gaussian 
channel PDFs, each MMSE estimator may be expressed as the 
inverse-variance-weighted sum of its prior and measurement 
based estimates, 



Y -bu 




2^2 



ere 



x\u 



'Y\X 



all 
3f 



(86) 
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a% 


9 2 

a a x\u 


(a 4 


bcurf, \ 2 







/(+) 



Y\X 



1 

j2~ 

} X 



(87) 



and the unconditional X-variance, a\, may be expressed as 



>x\u 



2 

a a 



(88) 



The various data and prior based estimates and variances used 
in arriving at the expressions ( 1861 ) have been inferred from the 
mean values and variances of the PDFs given in Eqs. d73l)-(l79l). 

c) MMSE in the Absence of Nuisance: To compute the 
mean squared values of these estimators, we first subtract 
and add the appropriate mean values of Y from it in the 
expressions d86b and then use the fact that E[(5Y + q) 2 ] = 
E[(SY) 2 ] + q 2 , where 5Y is the deviation of Y from its mean 
and q is any quantity independent of Y. For the MMSE^ - ) 
estimator, the mean we subtract and add is the conditional 
mean of Y, given u, namely (aa + b)u, so the following 
conditional squared mean value for it, given u, results: 



E[X 2 u (Y)\u] = f 



(")2 



a 2 a 2 



N 



+ aV// ( - )s 



N 



'X\U 



1° 



+ 



1 

— 2 

a x\u 



i 2 2 

+ a u . 



(89) 



An averaging of this expression over u with the help of the 
result E(U 2 ) — U 2 +a 2 r then yields the required mean squared 
value of the MMSE^ - ) estimator. Subtracting this squared 
mean value from E(X 2 ), the latter being simply a 2 U 2 
generates, according to Eq. d22l >. the MMSE^ _ \ 



' X' 



MMSE ( ~ 



' x 



2f T 2 

a U 



' x \u 



1° 



2 2 
^XXU^N 



a 2 a 2 xw 



(90) 



'JV 



where use was made of relation (188b in the second line. 

d) MMSE in the Presence of Nuisance: Subtracting and 
adding the mean value of Y , namely (aa + b) U, from Y inside 
the expression d86l > for the estimator X(Y) and the squaring 
and averaging over Y generates the following mean squared 
value of the MMSE estimator in the presence of nuisance: 

2 



E[X 2 (Y)} = f 



(+)2. 



+ / 



(+)2 



U Y\X 

a 2 U 2 



'x 



aba? 



' x 



a 



x 



(aa 



xbUr 



-i 2 



rr 2 a 2 



(aa 2 x 



aba 2 ,) 2 



a Y\x + ( aa x + abcr u) 2 / a x 



2rr2 

a U , 



(91) 

where use was made of the definition ( [87T > of to simplify 
both terms on the RHS. In view of relations ( |79l ) for the 
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conditional mean and variance of Y, given X = x, and the fact 
that all PDFs are Gaussian, we may express the unconditional 
variance of Y, namely <7y, as the sum of conditional variance, 
(?y\x' gi ven an d {a + ab * cr 2 j/(J x ) 2 times a\. This 
observation greatly simplifies the preceding expression, 



E[X 2 (Y)} 



2f T 2 

a U 



(aa 



aba 2 ) 



(92) 



The MMSE now follows from subtracting expression 
from E(X 2 ) — a 2 U 2 + a x , a result that can be simplified 
further in view of the relation between a\ and <?\\ x that we 
just noted in the previous paragraph, 



MMSE (+) 



2 2 
a Y\X a X 



(93) 



By using relations ( f79b . ( fTTT i. the alternate form of Uy given 
by relation (1741 ). and a 2 x = cr 2 x ^j + cr^, we may express the 
MMSE in the presence of nuisance in the more explicit form 



_ a z N a xlu + a 2 a 2 N alj 



MMSE 



+ b2 °x\u a2 u 



cr N + oP-o 2 ^ + (a + ab) 2 afj 



(94) 



In this form, we may easily compare MMSE( + ) to the corre- 
sponding result d90T > for MMSE in the absence of nuisance. A 
sequence of steps involving simple algebraic manipulations, 
followed by a use of the inequality, f 2 + g 2 > 2/<?, easily 
confirms the general result proved earlier that the presence 
of nuisance parameters can never reduce the MMSE for the 
estimation of the parameter of interest, 



MMSE (+) - MMSE (_) > 0. 



(95) 



This result is illustrated in Fig. 4 where we plot both 
MMSE( ± \ in units of a 2 N /a 2 , as functions of the variable 
X = o^^xm/^N f° r different values of the nuisance coupling 
parameter, 77 = aa/b. The reciprocal of x is a measure 
of the strength of the statistical correlation between X and 
nuisance U, while 77 represents the ability of nuisance to carry 
information about X through its statistical correlations with 
X. As x becomes larger, the prior on X becomes broader 
and the measurement becomes increasingly more dominant 
in controlling the MMSE whether the nuisance is absent or 
present. But, as expected, MMSE(~) does not depend on 
the coupling parameter 77 or the nuisance-parameter SNR 
defined as SNRr/ = b 2 (7^/(7%. On the other hand, MMSE<+) 
decreases with increasing 7/ since the nuisance becomes in- 
creasingly more effective - and the data Y increasingly less 
so - in controlling the MSE. With increasing SNRy, from 10 to 
100 between the two panels of the figure, the nuisance causes 
an increased error in estimating X, as its increased variance 
leads to an increased variance of the prior on X. But in no 
event does the MMSE in the presence of nuisance fall below 
the MMSE without nuisance. The optimal condition under 
which the presence of nuisance does not degrade the MMSE, 
i.e., MMSE( + ) = MMSE(~\ is achieved when x = 77, as seen 
from the figures and can also be easily shown analytically 
from the expressions d90l i and d94i l. 



IX. Conclusions 

In this paper I have derived a number of previously unknown 
relationships between mutual information and the minimum 
error of estimating a parameter from its measurements. A 
seoond order linear relation between MI and a prior-averaged, 
squared-deviation-weighted form of the FI accords added 
significance to the phrase "information" when describing the 
latter even though its chief claim to this phrase has been in 
the sense of being the reciprocal of estimation error. 

A second, more important relation between information and 
estimation error has been obtained in the fully Bayesian con- 
text of minimum mean squared error. I have shown, in particu- 
lar, that the Shannon equivocation, h(X\Y), in the differential 
sense cannot exceed (1/2) In (2ne MMSE), and hence the MI 
is bounded below by h(X) ~ (1/2) In (27reMMSE). 

Both these results were generalized to the case of MIMO 
channels. However, the MMSE-based lower bound on MI is 
not easily extendable to the discrete case. (I exclude here 
the trivial construct of associating with the PDF P(x) of 
a continuous random parameter X a discrete PD involving 
probabilities {pi = P(xi) Ax} computed for finite bins, 
centered at regularly spaced points Xi that are separated by 
an interval Air small compared to the scale over which P(x) 
varies significantly.) 

If additional input variables other than those of inter- 
est to the estimation problem are present, in general they 
serve to compromise the fidelity with which the variables 
of interest may be estimated. The impact of such nuisance 
variables on estimator performance was elucidated here with 
formulations based separately on MI, FI, and MMSE, and a 
number of important inequalities were derived that provide 
valuable insight into information and error-based metrics of 
performance. The MMSE based description of the nuisance 
is particularly intriguing since it seems to predict a nearly 
counter-intuitive result that the presence of nuisance, can never 
improve performance, even when it is strongly coupled to 
the input and has vanishing variance, i.e., independent of its 
statistical correlations with the input. This may be a peculiarity 
of how MMSE is defined, but surely deserves additional 
consideration. 
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